{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#把数据调整为标准正态分布"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "经常需要将数据标准化调整（scaling）为标准正态分布（standard normal）。标准正态分布算得上是统计学中最重要的分布了。如果你学过统计，Z值表（z-scores）应该不陌生。实际上，Z值表的作用就是把服从某种分布的特征转换成标准正态分布的Z值。\n",
    "\n",
    "<!-- TEASER_END -->"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##Getting ready"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "数据标准化调整是非常有用的。许多机器学习算法在具有不同范围特征的数据中呈现不同的学习效果。例如，SVM（Support Vector Machine，支持向量机）在没有标准化调整过的数据中表现很差，因为可能一个变量的范围是0-10000，而另一个变量的范围是0-1。`preprocessing`模块提供了一些函数可以将特征调整为标准形："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from sklearn import preprocessing\n",
    "import numpy as np"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##How to do it..."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "还用`boston`数据集运行下面的代码："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from sklearn import datasets\n",
    "boston = datasets.load_boston()\n",
    "X, y = boston.data, boston.target"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([  3.59376071,  11.36363636,  11.13677866])"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "X[:, :3].mean(axis=0) #前三个特征的均值"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([  8.58828355,  23.29939569,   6.85357058])"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "X[:, :3].std(axis=0) #前三个特征的标准差"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "这里看出很多信息。首先，第一个特征的均值是三个特征中最小的，而其标准差却比第三个特征的标准差大。第二个特征的均值和标准差都是最大的——说明它的值很分散，我们通过`preprocessing`对它们标准化："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "X_2 = preprocessing.scale(X[:, :3])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([  6.34099712e-17,  -6.34319123e-16,  -2.68291099e-15])"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "X_2.mean(axis=0)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([ 1.,  1.,  1.])"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "X_2.std(axis=0)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##How it works..."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "中心化与标准化函数很简单，就是减去均值后除以标准差，公式如下所示：\n",
    "\n",
    "$$x= \\frac {x- \\bar x} \\sigma$$\n",
    "\n",
    "除了这个函数，还有一个中心化与标准化类，与管线命令（Pipeline）联合处理大数据集时很有用。单独使用一个中心化与标准化类实例也是有用处的："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([  6.34099712e-17,  -6.34319123e-16,  -2.68291099e-15])"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "my_scaler = preprocessing.StandardScaler()\n",
    "my_scaler.fit(X[:, :3])\n",
    "my_scaler.transform(X[:, :3]).mean(axis=0)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "把特征的样本均值变成`0`，标准差变成`1`，这种标准化处理并不是唯一的方法。`preprocessing`还有`MinMaxScaler`类，将样本数据根据最大值和最小值调整到一个区间内："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([ 1.,  1.,  1.])"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "my_minmax_scaler = preprocessing.MinMaxScaler()\n",
    "my_minmax_scaler.fit(X[:, :3])\n",
    "my_minmax_scaler.transform(X[:, :3]).max(axis=0)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "通过`MinMaxScaler`类可以很容易将默认的区间`0`到`1`修改为需要的区间："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([ 3.14,  3.14,  3.14])"
      ]
     },
     "execution_count": 19,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "my_odd_scaler = preprocessing.MinMaxScaler(feature_range=(-3.14, 3.14))\n",
    "my_odd_scaler.fit(X[:, :3])\n",
    "my_odd_scaler.transform(X[:, :3]).max(axis=0)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "还有一种方法是正态化（normalization）。它会将每个样本长度标准化为1。这种方法和前面介绍的不同，它的特征值是标量。正态化代码如下："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "normalized_X = preprocessing.normalize(X[:, :3])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "乍看好像没什么用，但是在求欧式距离（相似度度量指标）时就很必要了。例如三个样本分别是向量`(1,1,0)`，`(3,3,0)`，`(1,-1,0)`。样本1与样本3的距离比样本1与样本2的距离短，尽管样本1与样本3是轴对称，而样本1与样本2只是比例不同而已。由于距离常用于相似度检测，因此建模之前如果不对数据进行正态化很可能会造成失误。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##There's more..."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "数据填补（data imputation）是一个内涵丰富的主题，在使用scikit-learn的数据填补功能时需要注意以下两点。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "###创建幂等标准化（idempotent scaler）对象"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "有时可能需要标准化`StandardScaler`实例的均值和/或方差。例如，可能（尽管没用）会经过一系列变化创建出一个与原来完全相同的`StandardScaler`："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 37,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "True"
      ]
     },
     "execution_count": 37,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "my_useless_scaler = preprocessing.StandardScaler(with_mean=False, with_std=False)\n",
    "transformed_sd = my_useless_scaler.fit_transform(X[:, :3]).std(axis=0)\n",
    "original_sd = X[:, :3].std(axis=0)\n",
    "np.array_equal(transformed_sd, original_sd)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "###处理稀疏数据填补"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "在标准化处理时，稀疏矩阵的处理方式与正常矩阵没太大不同。这是因为数据经过中心化处理后，原来的`0`值会变成非`0`值，这样稀疏矩阵经过处理就不再稀疏了："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 45,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "ename": "ValueError",
     "evalue": "Cannot center sparse matrices: pass `with_mean=False` instead See docstring for motivation and alternatives.",
     "output_type": "error",
     "traceback": [
      "\u001b[1;31m---------------------------------------------------------------------------\u001b[0m",
      "\u001b[1;31mValueError\u001b[0m                                Traceback (most recent call last)",
      "\u001b[1;32m<ipython-input-45-466df6030461>\u001b[0m in \u001b[0;36m<module>\u001b[1;34m()\u001b[0m\n\u001b[0;32m      1\u001b[0m \u001b[1;32mimport\u001b[0m \u001b[0mscipy\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m      2\u001b[0m \u001b[0mmatrix\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mscipy\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0msparse\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0meye\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;36m1000\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m----> 3\u001b[1;33m \u001b[0mpreprocessing\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mscale\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mmatrix\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m",
      "\u001b[1;32md:\\programfiles\\Miniconda3\\lib\\site-packages\\sklearn\\preprocessing\\data.py\u001b[0m in \u001b[0;36mscale\u001b[1;34m(X, axis, with_mean, with_std, copy)\u001b[0m\n\u001b[0;32m    120\u001b[0m         \u001b[1;32mif\u001b[0m \u001b[0mwith_mean\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m    121\u001b[0m             raise ValueError(\n\u001b[1;32m--> 122\u001b[1;33m                 \u001b[1;34m\"Cannot center sparse matrices: pass `with_mean=False` instead\"\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m    123\u001b[0m                 \" See docstring for motivation and alternatives.\")\n\u001b[0;32m    124\u001b[0m         \u001b[1;32mif\u001b[0m \u001b[0maxis\u001b[0m \u001b[1;33m!=\u001b[0m \u001b[1;36m0\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n",
      "\u001b[1;31mValueError\u001b[0m: Cannot center sparse matrices: pass `with_mean=False` instead See docstring for motivation and alternatives."
     ]
    }
   ],
   "source": [
    "import scipy\n",
    "matrix = scipy.sparse.eye(1000)\n",
    "preprocessing.scale(matrix)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "这个错误表面，标准化一个稀疏矩阵不能带`with_mean`，只要`with_std`："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 58,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<1000x1000 sparse matrix of type '<class 'numpy.float64'>'\n",
       "\twith 1000 stored elements in Compressed Sparse Row format>"
      ]
     },
     "execution_count": 58,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "preprocessing.scale(matrix, with_mean=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": false
   },
   "source": [
    "另一个方法是直接处理`matrix.todense()`。但是，这个方法很危险，因为矩阵已经是稀疏的了，这么做可能出现内存异常。"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.5.1"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}
